Goto

Collaborating Authors

 Freestone County


RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Gautam, Dhruv, Garg, Spandan, Jang, Jinu, Sundaresan, Neel, Moghaddam, Roshanak Zilouchian

arXiv.org Artificial Intelligence

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the creation of longer combined tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code. "Repetition is the root of all software evil" -- Martin Fowler Large language models (LLMs) have been quickly acquiring new capabilities (Bubeck et al., 2023), leading towards adoption of AI-powered systems in various formats and domains. The increasing usage of LLM powered tools like Github Copilot have greatly improved the capability of developers in software development tasks (Peng et al., 2023). More recently, an emphasis on multi-step execution through LLM feedback loops has unlocked the ability to solve harder problems within a variety of fields (Reed et al., 2022; Sumers et al., 2024; Yao & Narasimhan, 2023), including parts of software engineering. This new paradigm of solving larger software tasks has led to the construction of a variety of new automated software engineering (ASE) systems, most being structured as LM agents (Wang et al., 2024c; Cognition.ai, Evaluations for such systems are currently largely comprised from real world data on Github (Jimenez et al., 2024; LaBash et al., 2024). While being the strongest open-source signal for software engineering tasks at scale, Github is inherently noisy through its snapshot nature, also requiring strong filtration and validation testing for reliable evaluations (Chowdhury et al., 2024; Bowman & Dahl, 2021).


Statistical Rejection Sampling Improves Preference Optimization

Liu, Tianqi, Zhao, Yao, Joshi, Rishabh, Khalman, Misha, Saleh, Mohammad, Liu, Peter J., Liu, Jialu

arXiv.org Artificial Intelligence

Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal Policy Optimization (PPO). Recently, offline methods such as Sequence Likelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have emerged as attractive alternatives, offering improvements in stability and scalability while maintaining competitive performance. SLiC refines its loss function using sequence pairs sampled from a supervised fine-tuned (SFT) policy, while DPO directly optimizes language models based on preference data, foregoing the need for a separate reward model. However, the maximum likelihood estimator (MLE) of the target optimal policy requires labeled preference pairs sampled from that policy. DPO's lack of a reward model constrains its ability to sample preference pairs from the optimal policy, and SLiC is restricted to sampling preference pairs only from the SFT policy. To address these limitations, we introduce a novel approach called Statistical Rejection Sampling Optimization (RSO) that aims to source preference data from the target optimal policy using rejection sampling, enabling a more accurate estimation of the optimal policy. We also propose a unified framework that enhances the loss functions used in both SLiC and DPO from a preference modeling standpoint. Through extensive experiments across three diverse tasks, we demonstrate that RSO consistently outperforms both SLiC and DPO on evaluations from both Large Language Model (LLM) and human raters.


Score-Based Generative Modeling with Critically-Damped Langevin Diffusion

Dockhorn, Tim, Vahdat, Arash, Kreis, Karsten

arXiv.org Machine Learning

Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.


Adversarial Variational Inference and Learning in Markov Random Fields

Li, Chongxuan, Du, Chao, Xu, Kun, Welling, Max, Zhu, Jun, Zhang, Bo

arXiv.org Machine Learning

Markov random fields (MRFs) find applications in a variety of machine learning areas, while the inference and learning of such models are challenging in general. In this paper, we propose the Adversarial Variational Inference and Learning (AVIL) algorithm to solve the problems with a minimal assumption about the model structure of an MRF. AVIL employs two variational distributions to approximately infer the latent variables and estimate the partition function, respectively. The variational distributions, which are parameterized as neural networks, provide an estimate of the negative log likelihood of the MRF. On one hand, the estimate is in an intuitive form of approximate contrastive free energy. On the other hand, the estimate is a minimax optimization problem, which is solved by stochastic gradient descent in an alternating manner. We apply AVIL to various undirected generative models in a fully black-box manner and obtain better results than existing competitors on several real datasets.